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1. INTRODUCTION 

The sale is a lifeline of every business sales prediction that significantly impacts companies. 
Accurate predictions benefit the organization to maintain the standard and increase the company's lifestyle by 
using different strategies [1]. Typically, a prediction is based on the knowledge of previous studies with a 
deep focus on the conditions and then applies various factors, including customer's taste, culture, 
marketplace, and many more. In short, we can say that our prediction depends upon the previous study results 
[2]. Every business needs to be good in profit, and profit does not mean that stock sales are at maximum but 
also avoid the extra stock. Every retailer must maintain the stock according to the requirements and check the 
flaws and drawbacks that lead the sales down. Therefore, the proposed study deals with the same problem by 
predicting the store's sales [3]. 

Therefore, this paper divides the prediction into different phases, including preprocessing the data. 
In the first section, we explore and impute the missing values of the data by using statistics, then we use 
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feature engineering to explore more data and are ready to pass in models. Splitting the data into train and test 
variables with some ratio is better. Scikit-learn gives us the reliability to split the data into the desired ratio. 
This process gives a benefit to avoiding over and to underfit. After, we applied machine learning models to 
predict the results. We call the regressor of each method and then predict and compare it with the test values. 
We also see the factors like weather, culture, and events, which depend on the sales [1]. 

We need a large amount of data for any prediction. To forecast sales, we need a statistical model 
such as regression techniques and random forest models for prediction [4]. To access more accuracy, we need 
models that are combined different approaches. Many hybrid models give better results in predictions [2], 
[5]. Xia and Wong [3] highlight the difference between classical and modern methods. Most models are not 
asymmetric in real-world sales data because they are linear models. However, in current cases, these issues 
could be handled using root mean square error (RMSE) [6] and mean absolute error (MABE), which have been 
employed as a performance measure of this study. We performed many machine learning models one by one 
and compared the values to check the models' performance. We use the general flow of the data science in 
which we hypothesize the data, explore the data, impute the missing values, do some feature engineering and 
then select the model that best fits the data sets. In (1) and (2) shows the performance measure of MAE and 
RMSE. 


a. 
MAE = ated [pena — Xactual | (1) 


RMSE = Jaded { [Xpredict — Xactual I)’ (2) 


This paper is organized: section 2 presents a comprehensive literature review and discusses the 
related works. Section 3 describes the characteristics and nature of the data set utilized and discusses a 
comparison performed over the data set based on the results of three algorithms. Section 4 provides the 
model performance evaluation, and finally, section 5 concludes the paper. 


2. LITERATURE REVIEW 

The critical factor of every big mart or store chain is sales. The sale factor has two faces, profit and 
loss. Therefore, to maintain the mart's standard, the sales graph has to be good every year. This can only be 
possible by forecasting the sales correctly or approximately well. In this section of the literature view, we see 
many previous searches and methods used to predict sales as perfectly as possible by many researchers. 
Which are restate: Ait-alla et al. [7] suggested a scientific framework for the manufacture and design of a 
robust clothing distributor. The author's focal point is decision-making power at various distribution centers. 
Another proposed model and claims that his model is best to perform robustly and perfectly according to the 
customer's demand. A different mechanism is proposed for machine learning in different fields [8]. Langley 
and Simon [9] presented a technique for the business that used data mining called rule induction technique. 
They give a solution based on the mining of data. 

On the other hand, the author describes the distribution related to pharmaceutical firms. This paper 
focuses on two major issues: the stock never goes out of stock, and the second is related to the customer's 
satisfaction on behalf of sales [10]. Punam ef al. [11] proposed a model based on two-level statistical 
analyses that help forecast store sales. In another study proposed by [12], the author uses the neural network 
to predict clog's weekly sales store. This forecasting helps the owner to balance the stock regularly. Das and 
Chaudary [12] proposed a model based on a linear and non-linear statistical framework for forecasting 
trading. They use both regressions in the model and separate the results based on the given data sets. 

Additionally, many researchers research the sales market prediction, same as Beheshti-Kashi et al. 
[13] work on the fashion market and predicting sales using general prediction algorithms. Asur and 
Huberman [14] contribute to the famous blogging site "Twitter' They also focus on the revenue of movies at 
the box-office. Another study proposed by O’Connor [15] contributes to the music field. They train their 
model to check that the chatter matter in the selection of music and which type of music gives the best sales. 
Bermingham and Smeaton [16] work on Twitter sites' political affairs and predict the election results on 
behalf of posts. Prediction of the stock exchange is discussed in this paper [17]. Many papers are not related 
to sales prediction; however, to get the idea of forecasting, we read that paper [18], forecasting the long-term 
generation of electric power based on hourly monitoring and the previous year's record. 

Similarly, another research on the rental demand for bicycles depends upon the environmental 
weather, and researchers use the Regression model for this purpose [19]. The right location gives an extra 
advantage to sales of retail stores. Karna et al. serve their services by selecting the right place or location for 
the store to impact monthly sales [20]. They are many factors that participate in forecasting. Data is directly 


BMSP-ML: big mart sales prediction using different machine learning techniques (Rao Faizan Ali) 


876 0 ISSN: 2252-8938 


proportional to the prediction; the prediction will be wrong if the data set is incorrect. Under the variability of 
items, it is difficult to predict sales. They are many factors that give a negative result in the case of 
uncertainty. Yuan ef al. [21] deal with the effective sales of E-commerce and discuss the design and 
management that help predict the sales on behalf of user behaviors. Yuan et al. [21] proposed a model that 
focuses on predicting current sales and tries to sort out the best relation b/w the customer's choice and stock 
selection. Fakharudin et al. [22] a researcher that helps publisher to overcome their problem of publishing the 
right quantity of newspapers, use the neural network in their techniques and train their model on behalf of 
previous data sets of newspapers and the confusion matrix of their model gives the result value between the 
forecasting and the original value is strongly co-related. Significant components of almost every data science 
research are data mining, machine learning, and data visualization, so that is why we used to read many 
research papers related to major components. One of the research is data mining on the urban traffic 
prediction we used in our research [23]. This paper focuses on the classifications model and studies traffic 
behaviors to predict traffic affairs. 

Even for small bossiness, sales analysis is not an easy task. This research is based on the inspection 
of the customer's bargain and updates the critical adjustments based on the analysis of the costumer's bargain 
data [24]. The authors set the plans by adding a forum in the mobile application so that customers will buy 
directly from sellers. Nagamma et al. [25] conduct their research to find the relationship between ticket booth 
revenue and the film's online viewers. They use a support vector machine (SVM) classification model to train 
and predict the online viewers’ movie revenue. South Korean researchers named Kim and Youn use the huff 
model to predict agricultural product sales [26]. Their search is based on the value of A, which is reversed in 
the case of 1 and store mass is directly affected by travel time if the value of A is below 0.5. 


3. METHODOLOGY 

Forecasting mart sales is an exciting problem; data science also helps improve sales and business. 
We divide our proposed research method into the following steps: i) hypothesis making, ii) exploration of 
data, iii) cleaning the data, and iv) feature engineering. In order to build the model, at the very first, we have 
looked at the dataset hypothetically and extracted all the possible information to understand the data and 
research. Next, we explore our data's attributes, and the values for the conclusion. Missing values are not a 
big problem with data science, so we clean our data before feature engineering. It is quite possible that we 
need more variables that are not present in our data for analysis, so we create new variables for better 
understanding, and finally, we create a model to predict the sales. 


3.1. The problem assertion 

We have collected the data of 1560 of ten stores because we want to predict the sales of a certain 
product in different stores and find the reason for the difference in sales. The critical factor is determining the 
attributes that increase sales. Let us consider some hypotheses that can depend upon the sales factor. 


3.2. Mart hypothesis 

— According to the general hypothesis, store sales depend upon whether the stores are in urban or rural 
areas. 

— Another factor is population; it is clear that more population affects the sales. 

— The size of a store may also affect the sales; the more significant the store would print the store's quality 
on the customer's mind. 

— Opponent factor how many challenges are in the market. 

— Advertisement also a pillar of sales. 

— Location matters. The sale would be satisfied if the store is near a dense population. 

— Customers satisfaction. 


3.3. Product hypothesis 
— People always need quality products related to some brands. 
— Designing the product with good packaging also prints the excellent quality of the customer's mind. 
— Stores have a related product or not., i.e., stores should have the product related to daily usage. 
— Discount Promotions also attract customers. 
— Product price. 
These are some hypotheses which we considered important; this is not enough. They are also many 
factors which depend on sales, but we write according to our search. These hypotheses are for a better 
understanding of the data. Hence, the dataset detail is provided in the next section. 
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3.4. Data exploration 

In this step, we start exploring our data by comparing the data set's attributes and our hypothesis and 
checking the difference as given in Table | and Figure 1. We exclude that it differs slightly from our 
hypothesis and the original data. Hence, that is why we unvaried explore quality in original data, hypothesis, 
and the other way round. We got our data by web scrapping. Our data is split b/w two files named train/test, 
so we generally concatenate and split it before the model step. For this, we use the Concat function and shape 
to check the file size before and after concatenation. Therefore, before concatenation, the form of the train 
and test files are different than after concatenation. Now our data set is ready for the cleaning data. First, 
check the missing values in the dataset and the number of missing data in each attribute. 


Table 1. Attributes vs hypothesis 


Attributes Explanation Hypothesis link 
Product ID Product identification Unique identification 
Product W Product weight Absent in hypothesis 
Product Ingredient Good quality ingredients Relation with quality 
Product Display How much importance of a product in the display Present in hypothesis, related to the display. 
Product Type Product category Absent in our hypothesis 
MPR What is the price Present in hypothesis 
Year of Establishment How old the store Absent in our hypothesis 
Size How large the store Has relation with the capacity of the store 
Location Whether the store in rural or urban Present in hypothesis 
Product sale The monthly average sale of a product In hypothesis 


Outline of data for now: 


Figure 1. Effects of selecting different switching under dynamic conditions 


We have two types of variables to predict any value: target and process variables. In our case, we 
have to predict the outlet sales, so our target variable is outlet sales, but from Table 2, we can see that there 
are 5680 missing values, so we have to fill these missing values before building a model to predict sales as 
perfect as well. Describe () function is used for the statistical understanding; let us suppose we want to check 
the dataset's attribute's min, max, and avg. Therefore, the data exploratory analysis used in this study is 
presented in Table 3. 

Additionally, the exploratory data analysis was performed to provide us with statistical information, 
some vital information that we have to understand. This makes it unclear that if a product is being sold, how 
can it be possible that its appearance is zero. If we use since to define the year of establishment, then it looks 
cool to mention the range of years. Besides, we know that there are 1559 items in ten stores and sixteen 
unique product types. We ignore the Product identification and source because these attributes cannot predict 
sales. The selected frequencies of different nominal variables from the employed dataset are shown in 
Table 4. The occurrence rate of kinds for variable Product_FAT_Content. 

The information in section 3.4 gives us some information about the data. Many values use similar 
observations i.e., LF is for Lowfat, regular is for regular. Many are meaningful numbers, so it is better to 
concatenate them for valuable outcomes. Now our data is ready for the cleaning process. 


3.5. Data cleaning 

Outliers and missing values significantly impact data, so treating outliers and handling the missing 
values is essential. Data cleaning is how we improve our dataset by cleaning the data. We have two essential 
attributes in our datasets and need to fill in the missing values. The employed codes give a comparison of 
missing values before and after. For example, product weight has 2439 missing values before and zeroes after 
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implementing the above code. Output speaks that there are no more missing values. Further, another attribute 
has a missing value, Outlet_size, by using mode in aggfuc. 

Therefore, we have two essential attributes and need to fill in the missing values. The implemented 
codes give a comparison of missing values before and after. For example, product weight has 2439 missing 
values before the implemented method is used, and the zeroes method is performed. Output speaks that there 
are no more missing values. Further, another attribute with a missing value, Outlet_size, uses mode in 
aggfuc. Now attribute "Outlet_Type" also cleaning with zero missing value as shown in Table 5. Now, our 
data is ready for feature engineering. The feature engineering step is discussed in the next section. 


Table 2. Checking for missing values in the dataset 


Attribute Missing value 

Fat_Content 0 
Product Identifier 0 
MPR 0 

Outlet sale 5680 
Product Type 0 
Product Display 0 

Weight of Product 2440 
Year of Establishment 0 
Outlet Identifier 0 
Location 0 

Size 4015 
Outlet Type 0 
Source 0 


Table 3. Data exploratory analysis 


Product MRP Product outlet sales Product appearance Product weight Establishment year 
Count 14204.0000 8523.0000 14204.0000 11765.000 14204.0000 
Mean 141.004977 2181.288914 0.065953 12.792854 1997.830681 
Std 62.086938 1706.499616 0.051459 4.652502 8.371664 
Min 31.29000 33.290000 0.00000 4.55500 1985.0000 
25% 94.012000 834.247400 0.027036 8.710000 1987.0000 
50% 142.247000 1749/331000 0.054021 12.600000 1999.0000 
75% 185.855600 3101.296400 0.094037 16.750000 2004.0000 
Max 266.888400 13086.964800 0.328391 21.350000 2009.0000 


Table 4. Frequencies of different nominal variables Table 5. Mode for each Outlet_Type 


LowFat 8484 GroceryStore Small 
Regular 4825 SuperMarket | Small 
LF 521 SuperMarket 2 Medium 
Reg 196 SuperMarket 3 Medium 
Low fat 177 


3.6. Feature engineering 

Feature engineering helps us understand the data for better analysis. Here, we will create some new 
variables from the original data. When we explore our data for analysis, we observe that many shades have to 
be resolved. For example, remember talking about the concatenation of two Supermarkets because we almost 
think that both have the same sales. So, check whether it is true or not. The output shows the difference 
between supermarkets sales, and there is a significant difference b/w them, that why this idea will not be 
valuable to combine the data of markets. 

After data cleaning and data rangling, we were going to build our predictive model. This model 
plays a vital role in predicting sales. Our research uses six decision tree models, random forest, linear 
regression, and many more. So, let us start by using the baseline model initially. The baseline is independent 
of the forecasting model requirement; we use the mean of all the sales to forecast the sales. This simplifies it 
because we can predict by taking an average of all mart sales using the baseline model. On the other hand, 
linear regression is a beneficial and essential prediction model, and the library scikit-learn gives many 
machines learning models, one of which is the regressions model. Figure 2 shows the linear regression 
baseline model results. Now our public leaderboard is 1202, which shows the result better than the baseline 
model, but despite this, we can see many larger coefficients in magnitude, so we attempt to use ridge models. 
The public leaderboard is 1203, which looks more helpful than linear regression. Figure 3 shows the ridge 
model results with their mean and the attributes of the selected feature. 
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The decision tree is also a beneficial model for predictions. That is why we include this model in our 
study and compare the results with the previous model by visualization. Now sees the comparisons between 
different methods of visualization. In this experiment, we got a score on the public leaderboard is 1157. 
Hence, our result is not satisfying; that is why we try another model named the random forest model. This 
gives more satisfying results in forecasting because the public leaderboard shows 1157. Therefore, the 
achieved result is satisfying because of 1157 scores after using the decision tree model. 


4 EXPERIMENT RESULTS AND DISCUSSION 

Big marts' sales prediction is conducted by using many algorithms. We use machine learning 
algorithms to solve our dataset. Initially, we want to predict the sales of the mart by studying the sales of 
different marts with specific attributes, so that is why we set the "Price" attribute with the dependent variable, 
and we see above there are more than 10 attributes that we use as some independent variables. Our dataset 
contains two different train and test files; we concatenate our files to understand the data better. A hypothesis 
is necessary to check the possible attributes of the data. Also, it gives the understanding between the data 
scientist and the prediction. Therefore, we created some hypotheses and then compared them with the 
existing data. We saw a little bit of difference b/w the hypothesis and the data, and then we adjusted the data 
with our hypothesis attributes. Next, we moved to explore the data. In this part, we check the basic statistics 
of the dataset and missing values in the data. 

We found that three attributes have many missing values, and we will resolve these missing values 
in the coming section. Moved to the nominal variables, we checked the unique values in the data and found 
that there are 4-5 categorical variables. After, we need to impute the missing values further. Attributes named 
‘Product-weight' and 'Outlet-Size' filled the missing values, 2439 and 4016, respectively. We use mean and 
mode for the missing values product weight and outlet size, respectively. Feature engineering is the process 
of creating some new attributes for a better understanding of the data. 

Once we were ready with the data, we had to make a model. We used six machine learning models 
to predict sales and compare the results. When we applied the baseline model algorithm, we got a score of 
1772 on the public leaderboards, which was not satisfying. When working with machine learning (ML) 
models, it is good to split the files into train and test, but using the built-in function of sci-kit learn lib is a 
good idea. The advantages of using a split function are avoiding over-fitting and under-fitting, so use a split 
function before using any machine learning algorithm. The linear regression model gives the 1201 score in 
the public leaderboard. Ridge regression gives the 1202 public leaderboard score, which looks better than 
linear regression. Decision tree gives the RMSE of 1057 and the mean common vulnerabilities and exposures 
(CVE) of 1090. This shows that the model is overfitting. Take a quick example of four top variables with 
eight deep lengths 151 minimum sample leaves. Now the value of RMSE has risen from 1057 to 1070. 
Random forest gives the 1153 value of the public leaderboard. Improvement can be possible when we change 
the value of max-depth and the number of trees, which may increase the computational burden of using the 
number of trees. Table 6 provides a comparative analysis of the leaderboard values. 


Table 6. A comparative analysis between different proposed methods 


Model name Public leaderboard score RMSE 
Baseline model 1772 Not consider 
Linear regression 1201 1128 
Ridge regression model 1202 1129 
Decision tree model 1161 1507 
Random forest model 1153 1070 


Additionally, our findings predict the sales on behalf of the other marts sales data. Our results are 
more accurate and near the original test data. Thus, this study method performs much better in using and 
predicting sales. We perceive different scenarios in which different models are best among all the models. In 
our case, all the model's public result is above 1150, so all are good enough to use. Figure 4 provides a 
comparative analysis between different proposed methods. 

Among all, the tree classifier shows high values of RMSE. Ridge Regression scores higher in the 
public leader board than the remaining models. Therefore, the outcomes show the importance of results in 
our research. While the proposed model has many benefits, we think there are some limitations. Our research 
only predicts sales based on specific attributes, but it is not good enough to use globally; for example, we do 
not include disasters in our research, so our prediction is invalid in case of disaster. For prediction, we use 
many machine learning models and then evaluate and compare the result of the final use. Data is always not 
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in a manageable manner, so we need to beautify our data before modelling. This paper tells us the sequence 
flow of research and further attain outcomes with machine learning models. For new retailers, this research 
gives the benefit in an investment sense. Initially, how much income is needed to start a new retail store or 
big mart and the monthly outcomes. 


PREDICTION PERFORMANCE OF 
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Figure 4. A comparative analysis between different proposed methods 


5 CONCLUSION 

This research proposed a framework that predicts mart's sales using a machine learning model and 
different techniques. This research uses the data of various marts and then combines and analyses the data so 
that any mart can check the product's demand and sales overall. This forecasting helps the retailers to set the 
stock quantity more accurately. Next, we use many models to study the scores outcomes. Considering the 
outcomes, it is suitable for present data, but extensive data might be unsuitable or change the model selection. 
For more accuracy, we need a massive amount of data with minimum outliers. Therefore, the authors seek to 
check the individual product demand in a particular area in future work. Further, in the future, a retailer 
checks the score of a specific product by entering product attributes and its store's information, like location, 
and culture. Also, we consider an online App for the costumer's review regarding the stores and specific 
products for future work. This App works as a ranking App. Customers rank the stores by giving feedback; 
this helps the other customers to move on towards the stores. Sort of forum allows checking the demand for a 
specific store's specific product. Also, this kind of portal helps the retailer compare the stocks and scores of 
the public leader board. 
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